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^ Abstract 

, ^, We describe three approaches for computing a gravity signal from 

a density anomaly. The first approach consists of the classical "sum- 
^7^ mation" technique, whilst the remaining two methods solve the Pois- 

,__! son problem for the gravitational potential using either a Finite Ele- 

lO ment (FE) discretization employing a multilevel preconditioner, or a 

Green's function evaluated with the Fast Multipole Method (EMM). 
The methods utilizing the Poisson formulation described here differ 
from previously published approaches used in gravity modeling in 
that they are optimal, implying that both the memory and compu- 
^^ tational time required scale linearly with respect to the number of 

^ unknowns in the potential field. Additionally, all of the implementa- 

k> tions presented here are developed such that the computations can be 

^ performed in a massively parallel, distributed memory computing en- 

vironment. Through numerical experiments, we compare the methods 
on the basis of their discretization error, CPU time and parallel scal- 
ability. We demonstrate the parallel scalability of all these techniques 
by running forward models with up to 10^ voxels on lOOO's of cores. 



1 Introduction 

1.1 Background 

The use of forward models to compute synthetic gravity signals is necessary 
to conduct inversions of the subsurface density structure. Given a volume 
Qm over which we have a density field p(x), the gravity attraction at a point 
r = (r, s, t) due to this body can be computed via 



An alternative way to compute the gravity field is to solve the gravitational 
potential equation 

V20 = -47rGp(x) in fi^o, (2) 

where is the potential, G is the gravitational constant, Q^o denotes the 
entire free space and we assume that p(x) =0, Vx ^ Qm- The potential is 
subject to the following Dirichlet boundary condition 

= 0, at X = oo. (3) 

The gravity field sought is given by the gradient of the potential 0; 

g(x) = -V0. (4) 

The physical model is depicted in Fig. [T] Forward gravity models typically 
fall into one of two categories: summation based techniques which evaluate 
Eq. ([I|, or partial differential equation (PDE) based techniques which solve 
the gravitational potential formulation in Eqns. (|2])-(|4]). 

The summation methods require the subsurface density structure to be 
discretized into a set of volumes. At each location r, in the model domain 
where a gravity signal is sought, the gravitational contribution from each 
density element in the domain is evaluated using Eq. ([I]) and summed. The 
summation methods differ in the manner in which the integral expression 
in Eq. ([T| is evaluated. Several analytic approaches exist in which a closed 
form expression for Eq. ([I| is used in either Cartesian (see Li and Chouteau 



(1998[) for an overview) or spherical coordinates (Johnson and Lithehiser 



1972 Smith et al. 2001[). The limitation of analytic expression is that one 



is forced to choose a spatial discretization for the density structure which is 




Figure 1: Problem domain for computing gravity. Here we denote the infinite 
domain boundary by Sfioo, the model domain by Q and density anomaly 
domain by Q^j. The center of mass of Qm is denoted by ro and n is the 
outward pointing normal to the boundary of Q. 



orthogonal to the coordinate system, and the density is usually required to 
be constant over each element. The complexities and discretization restric- 
tions of the analytic method can be overcome by using a sufficiently accurate 
quadrature scheme to approximate Eq. ([I]). This approach permits any spa- 
tial discretization to be used provided a high accuracy quadrature rule can be 



defined over the geometry of each cell used in the discretization ( Asgharzadeh 



et aH|2007[ ). 

Recently there has been some interest in using PDE based approaches to 
compute gravity anomalies, as these methods have been demonstrated to be 
both faster and produce more accurate forward models than the summation 



techniques. In Cai and Wang (2005), a finite element method was used to 
obtain the solution to the Poisson equation. They favoured the finite ele- 
ment method over the finite difference method as the former allowed more 
geometric freedom in meshing the density anomalies and the formulation eas- 
ily permitted a variable density field within each voxel. Their formulation 
utilized a Robin type boundary condition to approximate the boundary con- 
dition in Eq. pi). The method was regarded as being "fast" since within a 
finite size domain, the Robin condition yielded a smaller error than setting 
= on the boundary of a finite domain. That is, the convergence of the 
error using this method was faster than simply setting = on the boundary 



of the finite domain. In contrast, Farquharson and Mosher (2009) employed 
a finite difference discretization to solve Eq. ([2]), where the boundary condi- 
tion = at X = oo is approximated by ensuring that the model boundaries 
are "far" from the density anomaly, which in their work constituted using a 
model domain with side lengths six times larger than the side length of the 
anomaly. 

The development of fast and efficient forward models is crucial to enable 
high resolution inversion to be performed. In considering the computation 
complexity of the summation algorithm, we see that if we discretize the 
domain with A^ density elements and we have M measurements, i.e. loca- 
tions where we will evaluate the gravity, the calculation will require 0{MN) 
time. Given the ease with which gravity measurements can be made on a 
regional scale using either a land-based relative gravimeter or via airborne 
measurements, or on a global scale using satellite based gravimetry, applied 
geophysics studies may typically have values of M on the order of 10,000. The 
number of measurements M is continually increasing as new techniques are 
developed, or existing techniques become affordable or automated. We note 
that the computational cost of evaluating the gravity contribution from one 



element via Eq. ([T| is not insignificant. Even the simplest 1-point quadra- 
ture rule requires: 5 additions, 7 multiplications and one square root, which 



is equivalent to the cost of ~ 20 multiplications (Fog, 2011). 

Using the PDE approach, one obtains the value of the potential over the 
entire domain, from which the gravity can be computed as a post-processing 
task. Consequently, the PDE approaches have a computational complexity 
which is not a strong function of the number of evaluation points, but instead 
is dominated by the complexity of the linear solver (X) used to obtain the 
potential, i.e. the overall method scales according to 0{N + X). If sparse 
direct factorizations (such as Cholesky or LU) are used, the solve time will 
scale like X = O^n^^"^) in 2D and X = 0{'n?) in 3D, where n is the number of 
unknowns used to represent the discrete potential field. The memory usage 
for these solvers is ~ 0{n\ogn) and ~ 0{n^^^) for 2D and 3D respectively 



(Li and Widlund, 2007). If unpreconditioned Krylov methods like conjugate 
gradient are used, the solve time will scale according to X = 0{'n?/'^) and 
(9(n^/^) in 2D and 3D respectively. Numerous optimal multilevel precon- 
ditioners exist for the Poisson equation in which both the solve time and 



memory usage will scale like 0{n) (Trottenbert et al. , 2001). 



1.2 Present work 

Here, we examine several variants of the summation method, a finite element 
method with two types of boundary conditions and a fast multipole method 
to compute synthetic gravity fields. Our examination of the different meth- 
ods focuses on the accuracy and the algorithmic complexity (optimality) of 
the techniques. All of the methods used in this study are developed to be 
executed on massively parallel, distributed memory computer architectures. 
We also examine the parallel performance (scalability) of the three classes of 
the methods under consideration. 



2 Numerical techniques 

2.1 Summation 

We considered three variants of the summation technique in this study. Each 
of the summation techniques is defined in a Cartesian coordinate system and 
utilized a structured mesh of hexahedral cells to discretize the density field. 



The model domain considered was always "brick" shaped and thus was easily 
decomposed into a set of Mx x My x M^ cells . Within each cell, the density 
is assumed to be constant. The first summation approach (which we identify 



as SUM- an) uses the analytic expression from Li and Chouteau (1998) to 
evaluate the vertical component of the gravitational contribution gz{^)i given 
by Eq. ([T|. The other two methods we consider use either a one point Gauss 
(SUM-Gl), or a two point Gauss (sum-g2) quadrature scheme to evaluate 
the gravity integral. 

Parallelism is achieved in the summation methods via a spatial decom- 
position of the mesh used to discretize the density field. The locations where 
the gravity field is required to be evaluated are duplicated on each processor. 
Every processor calculates a local gravitational contribution at each evalu- 
ation point from a subset of cells within the entire domain. This operation 
can be completed without any communication. The only communication 
required is a global reduction of the local gravity contributions from each 
processors local subdomain. 

2.2 Finite element method 

The Poisson equation in Eq. (p|) is solved using a standard Galerkin Finite 



Element (FE) formulation (Hughes, 1987). The variational form is given by 



f vV^cpdV = AnG f vp{x)dV, (5) 

J i2oo •^ * *oo 



where w is a test function which vanishes on all Dirichlet boundaries. Apply- 
ing integration by parts to the second order derivative in Eq. ([S]), we obtain 

- f Vv.V(pdV+ I vV^.ndS = AnG f vp{x)dV. (6) 

Here we consider using two different approaches to approximate the "Dirich- 
let at infinity" boundary condition in Eq. ([s]). Both methods first approxi- 
mate the entire free space domain VL^, by a finite sized domain VL, satisfying 
^M ^ ^- The first approximation of Eq. ([s]) we consider simply requires that 

0|af7 = O, (7) 

where dVt denotes the boundary of Vt. Clearly, the larger the domain Vt is 
compared to the domain of the density anomaly fijv/, the better the approx- 
imation. We will denote this particular boundary condition approximation 

as FEM-D. 
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The second approximate boundary condition we considered was intro- 



duced by Cai and Wang (2005) and consists of approximating the far field 
gravitational attraction on a finite sized domain Q. The far field gravity is 
approximated according to 



\dn 



(8) 



dn 



where r^ = xl^j^— fq and tq is the centroid of the density anomaly domain Qm- 
These quantities are indicated on Fig. [TJ Using the definition of the potential 
from Eq. Q, we can introduce Eq. (pi) naturally into the variational problem 
in Eq. (|6]) as a Robin boundary condition. We denote this boundary condition 
approximation as fem-gt. For a thorough description of the finite element 
formulation and the implementation of the Robin boundary conditions, we 



refer readers to Cai and Wang (2005). 



As in the summation method, the domain consisted of a brick like ge- 
ometry and was discretized with M^ x My x Mz hexahedral elements. The 
discrete solution for was represented with piecewise trilinear (Qi basis) 
functions over each hexahedral element. The same mesh was used to define 
the density structure. In the FE implementation used here, the density was 
assumed to be constant over each element. The resulting discrete problem 
from the FE discretization yields the sparse matrix problem 



fL + Fl X 



(9) 



where x, b represent the discrete potential and force term, L is the discrete 
Laplacian and F is the term associated with the far field boundary condition 
appearing in the surface integral in Eq. ([6]). We note that F = when the 
FEM-D approach is used. 

Following the solution of Eq. ([o]), we compute the gravity within each 
element by interpolating the gradient of the trilinear basis functions used to 
approximate 0. This approach has the disadvantage that the gravity field 
computed is discontinuous across element boundaries. The reconstruction of 
a continuous C° nodal field from the gradient of a finite element solution is a 
thoroughly studied problem. The Super Convergent Patch Recovery (SPR) 



(Zienkiewicz and Zhu 1992) and the Recovery by Equilibrium of Patches 



(REP) (Boroomand and Zienkiewicz, 1997) are both appropriate techniques 



to recover an accurate nodal gravity field. In Cai and Wang (2005), a nodal 



gravity field was computed using a global L2 projection. A local L2 projection 



can also be used (Hughes, 1987), which has the advantage of not requiring the 
solution of a global matrix problem. In practice, to enable the gravity field 
to be evaluated everywhere, a continuous gravity field defined on the nodes 
of the finite element mesh is the most useful representation. In this work 
however, we only use the results of the gravity field to compute error norms, 
for which the element wise, discontinuous representation of the gravity field 
is sufficient. 

The matrix problem in Eq. (|9| was solved using FGMRES (Saad, 2003), 
preconditioned with one V-cycle of geometric multrgrid (GMG). The GMG 



preconditioner we used is fairly standard and we refer to Briggs et al. (2000); 



Wesseling (1992) and Trottenbert et al. (2001) for an introduction to these 



methods. Here we briefly summarize the components used in our multgrid 
preconditioner. 

The multigrid method utilizes a mesh hierarchy consisting of rii levels. 
Each level in the hierarchy defines a mesh of different spatial resolution. In 
the results presented here, a grid refinement factor of two was used between 
each grid level. The mesh at level rii has the finest resolution and represents 
the mesh used to discretize the potential field problem. The operator A = 
L + F was defined on each mesh within the hierarchy by re-discretizing the 
PDE. Trilinear interpolation was used to define the restriction operator R, 
which is required to project nodal fields from a fine grid, to the next coarsest 
grid. Interpolation of fields from a coarse to fine grid was given by R"^. On 
every grid level except the coarsest, we employed A^^ Richardson's iterations, 
combined with a Jacobi preconditioner as our smoother. Given a vector y^ at 
iteration k, the application of the smoother is given by the following sequence 



yfe+i 



yfc + diag(A) Mb -Ay, 



(10) 



Unless otherwise stated, Nj. = 2 was used in all experiments. On the coarsest 
grid level, the smoother was defined via an LU factorization. 

In our Poisson solver, the action of Ay^, required by the smoother in 
Eq. (10) (on all grid levels expect the coarsest) and during each FGMRES 



iteration (finest grid only), was defined in a matrix-free manner. Similarly, 
diag(A) was computed element-by-element, without explicitly assembhng the 
full stiffness matrix A. On the coarsest grid, A was explicitly assembled to 
allow an LU factorization to be performed. 

At each iteration i of the Krylov method, we monitor the 2-norm of the 
residual r,- = b — Ax,-. The current estimated solution x,- obtained from the 



iterative method was deemed to be converged if ||rj||2 < 10^^°||ro||2, where 
ro is the initial residual. 

Support for parallel linear algebra, Krylov methods and the structured 
mesh representation were provided by the Portable Extensible Toolkit for 



Scientific (c)omputation (PETSc) (Balay et al. , 2010) 



2.3 Fast multipole method 

The Fast Multipole Method (FMM) is an algorithm that accelerates the 
solution of an iV-body problem, 

N 

g(x;.) = J]p.K(x;,x,), (11) 

i=l 

which is simply a discrete form of Eq. ([I]). Here, g(x') represents the grav- 
itational field evaluated at a point x'-, where the field is generated by the 
influence of sources located at the set of points {xj}. The sources are often 
associated with particle-type objects, such as charged particles, or in this 
case rock masses. In summary: {x'} is a set of evaluation points; {xj} is a 
set of source points with densities given by pf, and K(x', x) is the kernel that 
governs the interactions between evaluation and source particles. The kernel 
for the gravitational interaction in three dimensions is given by 

K(x;,x,) = ^f^. (12) 

|x' — x| 

Obtaining the field g at all the evaluation points requires in principle 0{MN) 
operations, for N source points and M evaluation points. The fast multipole 
method obtains g approximately with a reduced operation count, 0{M + N). 
In the FMM algorithm, the influence of a cluster of particles is approx- 
imately represented by a series expansion, which is then used to evaluate 
far-away interactions with controllable accuracy. To accomplish this, the 
computational domain is hierarchically decomposed, allowing pairs of sub- 
domains to be grouped into near and far, with far interactions treated ap- 
proximately. Fig. |2] illustrates such a hierarchical space decomposition for a 
two-dimensional domain, associated to a quadtree structure. 



Using this decomposition of the computational domain, the sum in Eq. (11 ) 



can be decomposed as 

A^ncar -'Vfar 

g(x;.) = J2 pM^'j,^k) + 5^PfcK(x;,x,). (13) 

k=l k=l 

The first term, corresponding to the near field of an evaluation point, will 



have a small fixed size independent of A^. The second sum of Eq. (13), 
representing the far field, will be evaluated efficiently using a series approxi- 
mation so that the total complexity for the evaluation is 0{N). We will use 
the following terminology for our field approximations: 

Multipole Expansion (me): is a p term series expansion that represents 
the infiuence of a cluster of particles at distances large with respect to 
the cluster radius. 

Local Expansion (le): is a p term series expansion, valid only inside a 
subdomain, used to efficiently evaluate a group of ME s locally in a 
cluster of evaluation points. 

The center of the series for an ME is the center of the cluster of source 
particles, and it converges only outside a given radius centered at the cluster 
of particles. In the case of an LE, the series is centered near an evaluation 
point and converges only inside a given radius. 

The introduction of a single representation for a cluster of particles, via 
the multipole expansion, effectively permits a decoupling of the influence of 
the source particles from the evaluation points. This is a key idea, resulting 
in the factorization of the computations of me's that are centered at the 
same point, so that the kernel can be written 

p 

K(x;,x,) = ^c™(x,)/^(x;.) (14) 

m=0 

This factorization allows pre-computation of terms that can be reused many 
times, reducing the complexity of evaluation from 0{N^) to 0{NlogN). 
Similarly, the local expansion is used to decouple the influence of an ME from 
the evaluation points. A group of me's can be factorized into a single LE, 
which allows the 0{N log N) complexity to be further reduced to 0{N). By 
representing me's as le's one can efficiently evaluate the effect of a group of 
clusters on a group of evaluation points. 
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(a) Domain decomposition. 
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(b) Near and Far field. 



Figure 2: Quadtree decomposition of a two-dimensional domain: (a) presents 
a hierarchical tree related to the full spatial decomposition of the domain; 
(b) presents a colored two-dimensional spatial decomposition for interacting 
with particles in the black box, and its equivalence on the tree. The near-field 
is composed by the dark yellow boxes and the black box itself, while the far- 
field is composed by the dark red colored boxes. Notice that the far-field is 
composed of boxes of different levels of the tree structure. The relationships 
between the nodes of the tree simplify the process of composing the near and 
far domains. 
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Hierarchical space decomposition 

In order to make use of the ME and LE, the domain must be decomposed 
into near and far subdomain pairs. A hierarchical decomposition provides an 
efficient implementation for this operation. The hierarchical subdivision of 
space is associated to a tree structure {quadtree structure in two dimensions, 
or an octree structure in three dimensions) to represent each subdivision. The 
nodes of the tree structure are used to define the spatial decomposition, and 
different scales are obtained by looking at different levels. Consider Fig. 2(a)[ 



where a quadtree decomposition of the space is illustrated. The nodes of the 
tree at each level cover the entire domain. The domain covered by a parent 
box is further decomposed into smaller subdomains by its child nodes. As 



an example of its use in FMM, consider Fig. 2(b) where the near-field for the 
black colored box is represented by the dark yellow colored boxes, and the 
far-field is composed by the dark red colored boxes. 
Overview of the algorithm 

We use a diagram of the tree structure to illustrate the whole algorithm 
in one picture. Fig. |3| The importance of this presentation is that it relates 
the control fiow and computation to the data structure used by FMM. 

After the spatial decomposition stage, the FMM can be summarized in 
three stages: the upward sweep, the downward sweep, and field evaluation. In 
the upward sweep, me's are constructed for each node of the tree. For each 
leaf node, me's are derived for each particle. On succeeding levels, these 
expansions are translated to the center of the parent node and combined. 
This is shown in Fig. |3] by the black arrows going up from the nodes on the 
left side of the tree. In the downward sweep phase, me's are first transformed 
into le's for all the cells in the interaction list of a given box. This process 
is represented by the dashed red-colored arrows in Fig. |3} For a given cell, 
the interaction list corresponds to the cells of the same level that are not 
nearest neighbors, but are children of the nearest neighbors of its parent 
cell. After this series transformation, the le's of upper levels are translated 
to the centers of child cells, and their infiuence is summed to obtain the 
complete far-field for each leaf cell. This process is represented by the dashed 
blue-colored arrows going down the right side of the tree in Fig. [3] At the 
end of the downward sweep, each box will have an LE that represents the 
complete far-field for the box. Finally, during the field evaluation phase, 
the total field is evaluated for every particle by adding the near-field and 
far-field contributions. The near field is obtained by directly computing the 
interactions between all the particles in the near domain of the box, consisting 

12 




fee] 



Create Multipole Expansions. 
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Evaluate Local Expansions. 
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Figure 3: Overview of the FMM algorithm. The diagram illustrates the 
upward sweep and the downward sweep stages on the tree. The follow- 
ing operations are illustrated: p2M-transformation of particles into me's 
(particle-to-multipole); M2M-translation of me's (multipole-to-multipole); 
M2L-transformation of an ME into an LE (multipole-to- local); L2L-translation 
of an LE (local-to-local); L2p-evaluation of le's at particle locations (local- 
to-particle). 



of nearest neighbor cells in the tree. 



In this work, we used the open source PetFMM package (Cruz et al. , 2010) 



to calculate the fast multipole operation in parallel. The PetFMM library 
was designed to offer both high serial performance and scalability, but also 
to be easily integrated into existing codes. The serial code is completely 
reused in the parallel setting so that we are never required to maintain two 
versions of the same algorithm. PetFMM leverages existing packages to keep 
its own code base small and clean. Parallel data movement is handled by 



the Sieve package (Knepley and Karpeev 2009) from PETSc (Balay et al 



2010, 2011), while load and communication are balanced using a range of 



different partitioners. In this work we employed either a simple geometric 
based partitioner which sub-divides the space into N^ x Ny x N^ cubes, or 



the graph partitioner ParMETis (Karypis and Kumar, 1998 Karypis, 2011) 
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3 Numerical experiments 



To understand the discretization error and CPU time required by each of the 
different classes of forward models, we considered a synthetic gravity model 
for which we have an analytic solution for the vertical gravity component 
Qz- The model domain VL consisted of a cube with side lengths L = 600 m, 
orientated such that fi = [0,600] x [0,600] x [-450, 150] m. Located at the 
centre of the domain was a cube with side lengths H = 100 m, to which 
we assigned the density, p = 2000 kg/m^. The surrounding material in the 
remainder of the domain was regarded as void and assigned a density, p = 



kg/m . The model setup is identical to that used in Farquharson and Mosher 



(2009). By regarding the dense cube as a simple prism, the analytic gravity 



field can be computed using the closed form expression of |Li and Chouteau 
( 1998[ ). The model setup and the analytic gravity field component gz is shown 
in Fig. m 



3.1 Discretisation error (convergence) 

The calculations for each numerical method used a mesh comprised of hexa- 
hedral elements. The number of elements in each direction was chosen such 
that the density anomaly was exactly resolved by the hexahedral elements. 
Hence, the error we measure from each method does not include any error 
due to the discretisation of the density field. We quantify the error in the 
vertical component of the gravity field Qz, using the Li norm 



Ei= f\gzi^)-g':i^)\dV, 
Jn 



the L2 norm 



and the L„o norm 



Eo 



\gz{^)~g':{^)\'dV 



1/2 



Eoo = max|^^(x) -^^(x)|. 
xesz 



(15) 



(16) 



(17) 



Here gz is the exact gravity computed via the analytic solution from Li and| 
Chouteau (1998), g^ is the approximate gravity field computed using one 



of three numerical methods (summation, FE, FMM) and fi is the model 
domain. 
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(a) Geometry 




(b) Gravity field 

Figure 4: Synthetic model used thorough out the numerical experiments, (a) 
Domain and density anomaly and (b) the corresponding analytic gravity field 
Qz (mGal). The inclusion is indicated by the transparent blue cube. See text 
for dimensions of the domain and density anomaly. 
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Figure 5: Convergence rate of the L\ norm for the gravity field computed 
using SUM-Gl and SUM-G2. 



3.1.1 Summation 

We computed the gravity component g^ with SUM-Gl and SUM-G2 using a 
number of meshes composed of M elements in each x, |/, z direction. The 
following grid sequence was used to measure the convergence rate, M = 
{12,24,48,96}. The side length of each element is given hy h = 600/M, 
hence for the mesh sequence used we have h = {50, 25, 12.5, 6.25} m. Given 
that SUM- AN employs an analytic solution for the gravity at a point due to 
hexahedral shaped density anomaly, the error expected is of machine preci- 
sion. Hence, we omit this method from the discussion of errors. The error 
in Eqs. (15), (16) and (17) was approximated via a 1-point quadrature rule 



over each hexahedral element in the mesh. The error Ei as a function of grid 
resolution is shown in Fig. [5J The convergence rate of gravity field in the 
discrete error measures Ei,E2, E^o is shown in Table [ij 

3.1.2 Finite element method 

The convergence behavior of the finite element methods fem-d and fem-gt 
was computed using the same grid sequence as in the summation test. Again, 
the mesh consisted of undeformed elements with Ax = Ay = Az = h. A 
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Table 1: Convergence rates obtained with the summation methods. 
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Figure 6: L2 error of the gravity field computed via fem-d and fem-gt. 



high order Gauss quadrature scheme was used to evaluate the error measures 
Ei,E2,E^. Details of how the error for the FE approaches was computed 
is provided in Appendix A. The L2 discretization error as a function of grid 
resolution h is shown in Fig. [6] The convergence rate of the gravity field in 
the discrete Ei,E2,Eoo norms is shown in Table |2} From these results, it 
immediately obvious that using the Robin boundary condition not only pro- 
duces smaller errors, but the FEM-GT method yields much higher convergence 
rates. 

To investigate sensitivity of the two boundary conditions used in the FE 
approaches to the size of the model domain, we performed another con- 
vergence test and varied the aspect ratio L/H, where model domain and 
anomaly length are denoted by L and H respectively. The anomaly size H 
was kept fixed at 600 m, whilst L was increased such that we had the fol- 
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Table 2: Convergence rates of the finite element methods for L/ H = 6. 



error 
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Figure 7: Convergence rate in L2 as a function of the domain size. 



lowing aspect ratios L/H = {3, 12, 18}. As in the other convergence tests, 
four meshes of increasing resolution were used. To keep the discretization 
errors comparable between the different models, we ensured that element 
size on each of the four meshes, for each L/H yielded element sizes of 
h = {50, 25, 12.5, 6.25} m. The L2 convergence rates are shown in Fig. [tI 
Here we see that the convergence rate of fem-gt is independent of the do- 
main size, whilst the convergence rate of the gravity field computed using 
FEM-D increases as the model domain increases. We expect that the rate 
from FEM-D approaches 1.0 as L/H — >■ cxo. 



Table 3: Convergence rates of PetFMM using different values of p. 



error 




P 






1 


4 8 


20 


El 


-2.58 


-0.66 1.45 


2.08 


E2 


-1.81 


0.08 1.51 


1.53 


Eoo 


-1.76 


-0.04 0.99 


0.99 



3.1.3 Fast multipole method 

The convergence rate of PetFMM was performed using the same mesh se- 
quence as in the summation experiments. As for the summation meth- 
ods, the error measures were approximated via a 1-point quadrature rule 
over each hexahedral element. The accuracy of the solution obtained via 
PetFMM is strongly related to the number of terms p used in the expansion 



of Eq. ( 14 ) . The measured convergence rate in the different norms are pre- 
sented for p = {1, 4, 8, 20} in Table |3] For the error measure Ei, we show the 
variation with grid resolution h in Fig. |8j Comparing with the rates from the 
summation methods from Table [T| we note that as p increases, the conver- 
gence rates of PetFMM approach those obtained using SUM-Gl and SUM-G2. 



3.2 Optimality (CPU time) 

Here we report the CPU time of the different numerical methods applied to 
the synthetic model described in Sec. [3j All timings reported were obtained 
with code compiled using GCC 4.4.3 with level three optimization and with 
an optimized build of the PETSc library. The timing runs were performed 
on Octopus, which is an 8-core Intel Xeon 2.67GHz (Nehalem) machine pos- 
sessing 64 GBytes of RAM. 

3.2.1 Summation 

On a given mesh, the time required for the summation methods is propor- 
tional to the number of locations where the gravity is evaluated. For this se- 
ries of tests, we evaluated the gravity on a regularly spaced array of 150 x 150 
points, located at the upper surface of the model domain. In Table |4], we re- 
port the total CPU time (sec) per gravity station on the following sequence of 
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Figure 8: Convergence rate of the L\ norm of the gravity field computed 
using PetFMM using different values of p. 



meshes, M = {6, 12,24,48,96}. Methods SUM-Gl and SUM-G2 compute the 
three components of the gravity vector, whilst SUM- AN and SUM-g1(2;) only 
compute the gravity field in the z direction. All methods possess an approx- 
imately linear relationship between the CPU time / station and the number 
of cells used to discretize the domain. Considering the one point quadra- 
ture rule methods, SUM-Gl is only a factor of 1.5 slower than sum-g1(2;). 
The slight increase in time required for SUM-Gl is a consequence of a more 
general quadrature. In this implementation, arbitrarily deformed hexahedral 
elements are permitted, whilst element edges were required to be perpendic- 
ular to the coordinate system in sum-g1(2;). Allowing the elements to be 
deformed requires that the integration be performed in a reference coordinate 
system, which thus requires the inverse Jacobian (coordinate transformation) 
to be evaluated. SUM-G2 was observed to be approximately 7 times slower 
than SUM-Gl, even though it employs 8 times as many quadrature points. 
The closed form method, SUM- AN is ~ 150 times slower than sum-g1(2;) and 
~ 13 times slower than SUM-G2. 
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Table 4: CPU time (sec) for the summation methods. The times reported are 
normahzed by the number of locations where the gravity field was evaluated. 
Here M is the number of cells used to discretize the subsurface in each 
direction and h is side length (m) of each cell. 

CPU time (sec) / station 
h (m) M sum-gl(2) sum-gl sum-g2 sum-an 

100 6 4.78e-07 7.36e-07 5.34e-06 7.39e-05 



50 


12 


3.71e-06 


5.72e-06 


4.29e-05 


5.50e-04 


25 


24 


2.98e-05 


4.56e-05 


3.40e-04 


4.39e-03 


12.5 


48 


2.37e-04 


3.65e-04 


2.68e-03 


3.76e-02 


6.25 


96 


1.90e-03 


2.92e-03 


2.13e-02 


2.83e-01 



3.2.2 Finite element method 

The FE calculations were performed using meshes consisting of M = {12, 24, 48, 96, 192, 384} 
elements in each direction. In all the calculations performed, the multigrid 
preconditioner used a coarse grid consisting of 6 x 6 x 6 elements. The num- 
ber of grid levels n^, was chosen to give the desired value of M on the finest 
grid level. 

In Table |5] we report the time required to perform the linear solve of the 
system in Eq. (|9]) using both fem-d and fem-gt. The time required for the 
solve represented more than 99% of the total execution time, thus only the 
solve time is reported. We observe that the number of iterations required by 
both methods are independent of the grid resolution. Furthermore, both the 
CPU time and memory usage scale approximately linearly with respect to 
the number of unknowns in the potential field, n = {M + 1)^. The solve time 
for the FEM-D method is slightly higher than that required by fem-gt. The 
difference in CPU time is attributed to the manner in which the Dirichlet 
boundary conditions were imposed during each application of the matrix free 
product. Ay. This particular operation could easily be further optimized in 
the future. 

3.2.3 Fast multipole method 

The performance of the PetFMM algorithm was measured using the same 
sequence of meshes as used in the FE approaches, i.e. the mesh contained 
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Table 5: Performance of the FE methods. The CPU time (sec) and the num- 
ber of iterations required by the Poisson solver are reported. The memory 
usage (MB) for fem-d and fem-gt are the same and are reported in the 
final column. 







fem-d 




fem-gt 






h (m) 


M 


CPU time (sec) 


Iter. 


CPU time (sec) 


Iter. 


Mem. (MB) 


50 


12 


1.17e-02 


8 


1.07e-01 


8 


< 10 


25 


24 


l.lOe+00 


9 


l.Ole+00 


9 


< 10 


12.5 


48 


8.87e+00 


9 


8.08e+00 


9 


4.00e+01 


6.25 


96 


7.11e+01 


9 


6.45e+01 


9 


2.85e+02 


3.13 


192 


5.67e+02 


9 


4.68e+02 


8 


2.20e+03 


1.56 


384 


4.15e+03 


8 


3.74e+03 


8 


1.70e+04 



M = {12,24,48,96,192,384} elements in each direction. The octree used 
to define the FMM data structure used k = 2^ cells along each axis, where 
L denotes the number of levels within the tree. For the mesh sequence 
used, we employed L = {2,3,4,5,6,7}. In these calculation presented, the 
gravity vector was computed at the centroid of each cell used to discretize the 
density field. In Table [6] we report the CPU time (sec) required to execute 
the PetFMM algorithm. The time required to evaluate the gravity field is 
negligible compared to time spend in the PetFMM algorithm and is thus not 
reported here. For these experiments, the graph partitioner ParMETis was 
used. 

For the sequence of meshes used in our test, an optimal FMM algorithm 
may be expected to yield execution times and memory usage requirements 
which increased by a factor of eight, for each increase in grid resolution. 
From Table [6} the memory usage is observed to follow this scaling. However, 
we note that the CPU time for PetFMM is observed to only approach the 
anticipated result as M increases. In Fig. M the solution time (solid thin 
line, left y-ax.is) and the solution time ratio for t^/tk-i (solid thick line, right 
y-axis), is plotted as a function of the number of elements in each direction 
M. The anticipated optimal value of tk/tk-i = 8 is denoted via the thin gray 
line. 

We can explain the deviation of this ratio observed with small numbers 
of voxels to a surface to volume effect. For a cube, divided into k pieces 
along each axis, we obviously have k^ small constituent cubes. Of these, 8 
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are corner cubes which have 7 neighbors. There are 12 edges of the large 
cube, each of which has k — 2 small cubes with 11 neighbors. Similarly, there 
are 6 faces of the large cube, each of which has {k — 2)^ small cubes with 17 
neighbors. The remaining {k — 2)^ interior cubes have 26 neighbors. We can 
check that the number of small cubes is correct, 

(A; - 2)^ + 6(A; - 2)^ + 12(A; - 2) + 8 (18) 

= (P - 6A;2 + 12A; - 8) + 6(A:2 - 4A; + 4) + 12(A; - 2) + 8 (19) 
= P. (20) 

If we assume that B particles are in every cube, then the direct work done 
per cube is given by 

W^ = ^^^^ + NsB' ^ (^Ns + B\ (21) 

where Nb is the number of cube neighbors. The ratio of work R, between a 
2k division compared to a fc division along each axis is given by. 



\k J 


{2k - 2)3f + 6(2A; - 2)2f + 12(2A; - 2)f + 8f 


(k - 2)3f + Q{k - 2)2f + 12(A; - 2)f + 8f ^^^^ 




53{2k - 2)3 + 210(2A; - 2f + 27Q{2k - 2) + 120 




53(A; - 2)3 + 210(A; - 2)2 + 276(A; - 2) + 120 ^ '' 




(8fc3 - 24A;2 ^ 2Ak - 8) + 3.96(4A;2 - 8A; + 4) + 5.21(2A; - 2) + 2^26 




(fc3 - 6P + I2k - 8) + 3.96(A;2 - 4fc + 4) + 5.21(A: - 2) + 2.26^'* 


= 


8fc3-8.16A;2 + 2.74A;-0.32 
fc3 - 2.04A;2 + 1.37A; - 0.32 ' ^ ' 




(26) 


In the first two tests considered in Table 6 , we have k = 2 and 4, thus 




^m^ 968.00 ^^^^3 

V2y 60.00 ^ ^ 


Even at A; = 


8 we have 




/16\ _ 95288.00 _ 




^ U ; 10392.00 - ^-^^ ^^^' 


and we can 


see that not inconsiderable surface-to-volume effects persist for 


larger octrees. The optimal ratio defined by Eq. (26) is denoted in Fig. 9 



via the dashed line. The agreement between the optimal and measure work 
scaling illustrated in Fig. |9] verify the optimality of the M2L-transformation. 
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Table 6: CPU time (sec) and memory usage (MB) for PetFMM with increasing 
grid resolution. In these calculations we used an expansion order oi p = 8. 
We note that the memory counter used in the implementation of PetFMM 
was not able to represent the number of bytes required for the case M = 384. 



h (m) 


M 


L 


CPU time (sec) 


Mem. (MB) 


50 


12 


2 


8.02e-02 


< l.OOe+00 


25 


24 


3 


1.19e+00 


1.67e+00 


12.5 


48 
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1.34e+01 


1.34e+01 


6.25 


96 


5 


1.27e+02 


1.07e+02 


3.13 
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6 


l.lle+03 


8.56e+02 


1.56 


384 


7 


9.33e+03 
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Figure 9: Computation time as a function of the number of density blobs 
M^ for PetFMM. The left y-axis denotes CPU time (sec) and right y-axis 
denotes the ratio of solution times between the current and previous grid 
resolution. For the grid sequence used, the asymptotic (hnear) scahng would 
yield a ratio of 8, here denoted via the dashed line. 
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3.3 Parallel scalability 

In order to measure the parallel performance of an algorithm, two types of 
studies are typically employed. The first measure considers weak scaling, in 
which a fixed number of unknowns per processor (i.e the work per processor) 
is kept constant and more processors are introduced. Thus the overall prob- 
lem size increases with the number of processors, but the work per process 
remains constant. Ideal weak scaling would yield a solution time which was 
independent of the number of processors which were employed. Alternatively, 
strong scaling considers a problem with a fixed number of unknowns which 
is solved using an increasing number of processors. Thus, the unknowns per 
processor decreases as the number of processors increases. Ideal strong scal- 
ing would yield a solution time which linearly decreases in proportion to the 
number of processors used to solve the problem. 

In the interest of developing fast algorithms for performing gravity inver- 
sions in a reduced amount of time, here we only consider the strong scalability 
of the three algorithms presented. If a simulation required to seconds on pi 
processors, the optimal time topt, on p2 > Pi processors is topt = ^0(^1/^2)- 
The parallel efficiency E of the strong scaling is measured according to 

E = 100 ( ^°P* ) , (29) 

\ T^measured J 

where tmeasured IS the measured time taken for the computation on p2 pro- 
cessors. All parallel results presented here were performed on the CADMOS 
IBM Blue Gene/P (http : //bluegene . epf 1 . ch) . 

3.3.1 Summation 

All of the summation algorithms considered here exploit parallelism by sub 
dividing the set of voxels used to represent the density structure amongst 
Up processors. The spatial decomposition of the mesh was defined by slicing 
the domain into Nx,Ny, Nz subdo mains such that Up = N^ x Ny x N^. The 
only communication required in our implementation is the global reduction 
(sum) of a vector of length equal to the number of evaluation points. Thus 
if the number of voxels in each processors subdomain is equal, the only de- 
parture from perfect strong scaling can be attributed to the single call to 
MPI_Allreduce. In Table [7] we report the CPU times obtained from using 
SUM-g1(2) with a model domain of 128^ voxels and 100^ evaluation points 
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Table 7: Strong scaling for SUM-Gl on CADMOS BG/P, using a mesh with 
128^ cells and 100^ evaluation points. Here Up indicates the number of pro- 
cessors used. (D) indicates the job was executed in DUAL mode, implying 
two processors per node were used. (V) indicates the job was launched in 
VN mode, in which all four processors per node were used. 





CPU time (sec) 


Up 


Total 


Reduction 


1 


1.0632e+03 


1.6999e-04 


8 


1.3922e+02 


1.2458e+01 


64 


1.8222e+01 


3.1192e+00 


128 


9.3883e+00 


2.0806e+00 


256 (D) 


4.8376e+00 


1.3013e+00 


512 (V) 


2.4939e+00 


7.8258e-01 


512 


2.4934e+00 


7.8200e-01 


2048 (V) 


7.0128e-01 


3.2787e-01 



which were regularly spaced in a horizontal plane located at the upper surface 
of the model. Both the CPU time for the total computation and the time for 
the global reduction are reported. We note the time for the global reduction 
does not exhibit perfect strong scaling for this set of experiments. Accord- 
ingly, when the time required to perform the evaluation and local sum of the 
gravity contributions is much larger than the time required for the reduction, 
excellent scalability is observed {up < 256). When this time is comparable 
with the cost of the reduction, the sub optimal scaling of the reduction will 
become significant and deteoriate the scaling of the total execution time. 
Comparing the total CPU times for n^ = 8 and 2048, we observe a parallel 
efficiency oi E ^ 78%. 



3.3.2 Finite element method 

The success of a parallel multigrid is largely dependent on the type of coarse 
grid solver used. We consider a direct extension of sequential multigrid al- 
gorithms which employ a direct solver on the coarsest grid level. The direct 
solve on the coarse grid was performed in parallel using either the multi- 



frontal method MUMPS (Amestoy et al., 2001), or by TFS (Tufo and Fis- 



cher, 2001). MUMPS is a general purpose parallel direct solver, whilst TFS 
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is specifically designed for matrix problems in which a processors subdomain 
contains very few degrees of freedom (as is the case on our distributed coarse 
grid). TFS has the limitation that the number of processors must be a power 
of two. 

To examine the strong scalability, we considered two experiments in which 
the fine grid contained either 256^ elements of 512^ elements. The coarse grid 
was defined via Mc elements in each direction. Both experiments used six 
grid levels, with Mc being 8 and 16 respectively. In our geometric multi- 
grid implementation, we require for a given grid, that each processor's local 
subdomain must contain at least one element. Accordingly, the number of 
elements in the coarse grid thus places an upper limit on the maximum num- 
ber of CPU's we can use. The results of the strong scalability are shown in 



Fig. [TOj The scalability of fem-d and fem-gt are expected to be identical 
so only the results of FEM-D are presented. The measured parallel efficiency 
on 512 CPUs was E ^ 90% for the problem using MUMPS and E ^ 68% on 
2048 CPUs for the problem employing TFS. 

3.3.3 Fast multipole method 

To examine the strong scalability of PetFMM, we considered three different 
meshes with M^ elements where M = {96, 192, 384}. For a given number of 
input density values, there is a number of levels Ly which minimizes the to- 
tal computation time. As in the multigrid implementations, they are certain 
restrictions upon the number of CPU's (rzp) which can be used with PetFMM. 
The primary constraint is on the number of local trees in the spatial decom- 
position. The number of local trees Nt is given by 2°'^''', where d = 3 is 
the spatial dimension and ri is the root level of the tree. For efficiency, it 
is required that Nt > Up, so that at least one tree is distributed to every 
process. 

For the parallel runs presented here, the simple geometric based parti- 
tioning algorithm was used to balance load and communication. The total 
execution times are reported in Table [8j 

The strong scaling efficiency is observed to decrease as the number of 
processors used increases and also as the root level increases. To better 
understand the reason for this scaling behavior, we examined the scalability of 
individual components within the PetFMM implementation. The breakdown 
of CPU times for the Up = {512 — 4096} series of jobs is shown in Fig 
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The downward sweep event involves both a parallel operation (indicated by 
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Figure 10: Strong scaling on the CADMOS BG/P for two different resolution 
FEM-D simulations. The optimal time is indicated by the dashed line. The 
coarse grid consisted of Mc elements in each direction and each model used 
111 = 6 levels. Two different coarse grid solvers, MUMPS and TFS were 
employed (see text for further details). 
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Table 8: Strong scaling of PetFMM on CADMOS BG/P. The times reported 
here represent the total time taken to perform the multipole summation 
(ParaFMMEvaluate). (S) denotes -mode SMP, (D) denotes -mode DUAL, 
(V) denotes -mode VN. * indicates efficiency was computed w.r.t the 64 CPU 
execution time (pi = 64). 



Up 

8 
16 
32 
64 

32 

64 
128 
256 
512 

512 
1024 
2048 
4096 



ri L^ M 



4 96 



192 



6 384 



I!PU time (sec) 


Efficiency 


3.9740e+02 (S) 


_ 


2.0950e+02 (S) 


95% 


1.1086e+02 (S) 


90% 


5.9088e+01 (D) 


84% 


9.2118e+02 (S) 


~5~ 


4.8627e+02 (D) 


95%, - 


2.5809e+02 (S) 


89%, 94%* 


1.4380e+02 (D) 


80%, 85%* 


8.6693e+01 (V) 


66%, 70%* 


7.8231e+02 (V) 


_ 


5.5052e+02 (D) 


71% 


4.3421e+02 (D) 


45% 


3.7705e+02 (V) 


26% 



'DownSweep" in Fig. 11) and a sequential operation at the root level of the 



tree (indicated by "Root Tree DownSweep" in Fig. 11). Thus, if the time 



required for the sequential operation is large compared to the time spent in 
evaluating contributions from the local parts of the tree, strong scalability 
will obviously suffer. The local calculations are all observed to strong scale 
well, however as the subdomains become smaller, the cost of the root tree 
will eventually dominate the overall execution time and reduce the parallel 
efficiency. In our experiments, the cost of the root tree evaluation grows by 
a factor of eight each time ri is increased by one. To offset the increasing 
cost of root level calculation, i.e. to observe better strong scalability, one can 
easily introduce work on each subdomain by increasing M. 



29 



1000 




512 1024 2048 

Number of processors 



4096 



Figure 11: Breakdown of the strong scalability of the individual PetFMM 
components. Note that not all components listed in the legend are visible in 
the bar chart as they represent a very small fraction of the total execution 
time. 
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4 Discussion 



In the experiments described in Sec. 3.1, the discretization error of the three 



methods was examined. In the norms measured, the convergence rates ob- 
tained using SUM-Gl and SUM-G2 were nearly identical. A measurable dif- 
ference in the absolute error between the different quadrature rules was ob- 
served, with SUM-Gl yielding errors approximately 2.3 times larger than SUM- 
g2. The rates measured between the two summation methods using Gauss 
quadrature and the rates obtained using PetFMM, were extremely similar, 
provided the expansion order p was high enough. In the cases where p < 4:, 
sub-optimal convergence (-E'2), or divergence was observed {Ei,Eoo). 

Both PetFMM and the summation methods incorporate the analytic solu- 
tion of the potential (or gravity) within the discretization, thus these methods 
naturally satisfy the boundary condition, = 0, as x — )■ 00. Within the FE 
methods considered here, this boundary condition was approximated. The 
convergence behavior of the gravity field obtained using finite element meth- 
ods is thus likely to be dependent on the choice of approximation made. 
In the absence of any boundary condition approximation and any approx- 
imations in defining the density structure, we anticipate the gravity error 
computed with Qi elements to behave like 



,hl 



2 



< ci/i, (30) 



where Ci is a constant independent of the grid resolution h. In the case of 
FEM-D, the boundary condition approximation is seen to limit how close the 
discrete solution will approximate the exact solution. Since the approximate 
boundary condition doesn't approach the true boundary condition in the 
limit of /i — )■ 0, the convergence of rate of the potential and gravity field will 
ultimately deteriorate with increasing grid resolution. That is we have. 



>i 



12 



<Cih + C2{j]. (31) 



This type of relationship is evident in Fig. [6] where we observe a low corre- 
lation between the straight line with slope 0.57 and the measured error. In 
practice this effect can be reduced if we ensure that the model domain is 
significantly larger than the domain defining the density anomaly, thereby 
making the coefficient C2 smaller. However, adopting this approach intro- 
duces significantly higher computational requirements. 

31 



Table 9: Cross over point between the summation algorithms and the PDE 
based approaches. The rows marked with the * indicate that the summation 
times were estimated from the summation simulation with M = 96. Columns 
4-7 indicate the number of evaluation points below which the summation 
algorithms are faster than FEM-GT and PetFMM. 



M CPU time (sec) / station 
sum-gl(2) sum-an 



fem-gt petfmm 

sum-gl(2) sum-an sum-gl(z) sum-an 



12 


3.71e-06 


5.50e-04 


2.88e+04 


1.95e+02 


2.16e+04 


1.46e+02 


24 


2.98e-05 


4.39e-03 


3.39e+04 


2.30e+02 


3.99e+04 


2.71e+02 


48 


2.37e-04 


3.76e-02 


3.41e+04 


2.15e+02 


5.65e+04 


3.56e+02 


96 


1.90e-03 


2.83e-01 


3.39e+04 


2.28e+02 


6.68e+04 


4.49e+02 


192* 


1.52e-02 


2.26e+00 


3.08e+04 


2.07e+02 


7.30e+04 


4.90e+02 


384* 


1.22e-01 


1.81e+01 


3.08e+04 


2.06e+02 


7.67e+04 


5.15e+02 



On the contrary, the alternative boundary condition approximation used 
in FEM-GT does not appear to place a bound on the minimum discretization 
error possible on a finite sized domain. This is apparent from Fig. |6] where 
a high correlation between the grid size and discretization error is observed. 
This suggests that the Robin boundary approximation converges like 0{h) 
as the mesh is refined, since we observe the first order convergence predicted 



from Eq. (|30|) in the gravity field and this convergence rate appears to be 

Nevertheless, 



independent of the domain aspect ratio L/H (See Fig. [7|) 

despite the improved convergence rate of fem-gt, the rates observed are 

lower than those obtained using either the summation methods or PetFMM. 

To assess the speed of the three methods examined, we consider defining 
the cross over point where the summation methods cease to be less effi- 
cient than either fem-gt and PetFMM. The cross over point occurs when 
the number of evaluation points exceeds tFEM-GT,petFMM/4um, where tsum is the 
time per evaluation point obtained from one of the summation algorithms. 
The number of evaluation points required to reach the cross over point for 
the sequential results are presented in Table. |9j We note that the times from 
Table |4] are repeated in the second and third column. The summation meth- 
ods were not run at a grid resolution of M = {192, 384}, therefore the time 
required for the summation algorithms was estimated from the time required 
by the M = 96 case and scaling this value by 8 and 64 respectively. 

All the three methods were observed to exhibit good strong scaling up to 
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1024 CPUs. By far the easiest method to obtain good parallel scalability was 
the summation methods. This is simply due to the lack of algorithmic com- 
plexity in the direct summation approach. Scalability here is only limited by 
the network of the computer cluster used. Our tests were performed on an 
IBM Blue Gene/P, which is known to have an excellent network with spe- 
cialized hardware for performing global reductions. The techniques used by 
FEM-D,FEM-GT and PetFMM are more difficult to obtain high strong scaling 
efficiency. In the context of the multigrid preconditioner, this was due to the 
design choice that the mesh on the coarse grid had to be distributed and that 
we required at least one element per CPU. This particular restriction could be 
relaxed if a different coarse grid solver was employed. For example, we could 
use a large coarse grid, use less levels in the preconditioner and employ an 
exact coarse grid solve using an algebraic multigrid (AMG) preconditioner. 
The AMG algorithms are useful in this context as they do not require any 
geometric information to determine how the work will be distributed across 
the CPUs. With PetFMM, speedup was measured up to 4096 CPUs, how- 
ever the measure efficiency was only 26%. Strong scaling with PetFMM is 
hindered by the sequential calculations which have to be performed at the 
root level. This is a typical bottleneck in FMM algorithms, however it could 
be eliminated by overlapping the root tree computation with the local direct 
summation work. This will be the object of future research. 

Lastly we consider the overall usability of the different methods for com- 
puting gravity anomalies from the perspective of an end user. The quadrature 
based methods are by far the easiest method to use. It permits complete geo- 
metric freedom in defining the underlying grid which is used to discretize the 
density field. No connectivity is required between the cells and the vertices. 
The only requirement is that the cells used to partition the domain defined 
by the density anomaly do not overlap. Consequently, topography, curvature 
and locally refined regions are easily introduced. In the method described 
here, a constant density was used within each cell. This is not strictly nec- 
essary and spatial variations of density within a cell are possible, however 
the order of the quadrature rule used would likely have to be increased to 
maintain the accuracy of the method. 

To use the geometric multigrid, a mesh hierarchy is required. Here we 
considered nested hierarchies of structured meshes. With such a topology, 
generating a mesh which has element faces which conform to all the jumps 
in density may be difficult to construct. This could be partially alleviated 
by using an unstructured mesh, but fully unstructured meshing in parallel is 
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still a challenging task. Furthermore, an unstructured mesh hierarchy would 
also be required to be generated. The convergence, and hence the CPU 
time required by the geometric multigrid method is strongly dependent on 
the mesh geometry. For example, the implementation described here ceases 
to be robust if the mesh possesses a high aspect ratio, or the elements are 
highly deformed. In such circumstances, stronger smoothers are required if 
rapid convergence is to be maintained. Stronger smoothers may for exam- 
ple include block Jacobi with ILU factorization defined on the sub-blocks. 
Such choices mandate additional storage and careful selection and tuning of 
smoothers to remain optimal. To some extent, many of the aforementioned 
disadvantages related to geometric restriction introduced by using GMG can 
be overcome using algebraic multigrid (AMG). AMG preconditioners require 
the stiffness matrix to be assembled and furthermore, maintaing both scal- 
able and optimal solution times in parallel is still a challenge with these 
approaches. The FE approaches does have the advantage that continuous 
density variations can be naturally introduce throughout the element. 

The FMM does not possess any geometric restrictions in how the density 
structure may be defined. Whilst structured grids were used here, FMM can 
in principal be used with a random point distribution which define the loca- 
tion and value of density in space. In the case when a random distribution of 
points is used, one also needs to provide the volume of the domain which is 
associated to each point. This can be readily computed using a Voronoi dia- 
gram, or preferentially in parallel calculations using an approximate Voronoi 
diagram. Thus the method provides high geometric fidelity without having 
the burden of creating a mesh, conforming or otherwise. The time required 
to compute the gravity signal is a function of the number of points used to 
discretize the density, and not dependent on their spatial distribution. The 
convergence of FMM could be improved by introducing a basis with more 
smoothness than the current delta function discretization. In future work, we 
will introduce a Gaussian basis for rock masses so that the convergence rate 
can be adjusted by varying the width of the Gaussians. The initial interpola- 



tion problem for this new basis will be solved using the PetRBF code ( Yokota 



et al.||2010D . 

Despite being more than two times slower than fem-gt, we believe that 
the geometric flexibility permitted in defining the density structure, com- 
bined with fact that the solve time is independent of the geometry of the 
discretization used for the density structure, make PetFMM more useful in 
applied geophysics studies. PetFMM was shown to be comparable in accuracy 
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to the SUM-G2 algorithm and should be used preferentially over this method 
if the number of evaluation points exceeds 77 x 10^. 

5 Conclusion 

Fast and robust forward models for computing a gravity signal from a den- 
sity distribution is essential to perform high resolution inversions of the den- 
sity subsurface. Here we have discussed three different forward modes for 
computing gravity anomalies and compared them based on the convergence 
rates of the obtained gravity field, the execution time required to evaluate 
the gravity field and the parallel scalability of the algorithms. We considered 
classical summation techniques based on closed form expressions or quadra- 
ture schemes, and optimal and scalable approaches suitable for solving the 
Poisson equation. The PDE based approaches consisted of a finite element 
discretization utilizing a geometric multigrid preconditioner and an imple- 
mentation of the fast multipole method. 

The summation methods employing quadrature approximations are found 
to yield results of comparable accuracy to FMM. Only the finite element 
method which incorporated a far-field gravitational approximation in the 
form of a Robin boundary condition was deemed to be useful in practice. 
The error incurred by specifying a vanishing potential on the boundary of 
a finite domain resulted in large errors, and low convergence rates in the 
gravity field. All the forward models demonstrated good strong scaling up to 
1024 CPUs. The fast multipole method presents itself as a viable alternative 
to classical summation methods due to the geometric freedom in defining the 
density structure and insensitivity of the overall CPU time to the underlying 
density structure. In comparison to the summation algorithm employing 
analytic expression for the gravity, FMM is faster provided more than 515 
evaluation points are used. If the simplest quadrature based summation 
algorithm is used, FMM will provide a faster forward model if more than 
77 X 10^ evaluation points are used. 
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A Error Evaluation 



Here we discuss the method used to evaluate errors defined in Eqs. (15), (16) 



and (17). The spatial variation of the discrete solution for the gravity field g^ 
is defined by the representation natural to discretization. For the summation 
and FMM, this means g'^ is represented via piecewise constant over each cell. 
For the FE methods, g^ is represented via a bilinear function g'^ = aQ + aiX + 
0'2y + dsxy, since the potential was discretized via trilinear basis functions. 



The integrals in Eq. (15),(16) were approximate via Gauss quadrature. The 
order of the quadrature used was determined empirically. The complexity 
of the analytic solution was such that low order rules were not appropriate 
to accurately estimate the norm. Over each cell in the discretization, we 
found that a 4-point quadrature rule, applied over m x m x m subdivision 
(in each x, y, z direction respectively) of each cell was sufficiently accurate. 
The value for m was obtained by evaluating ||5'z||i, ||f72||2,-Eoo and examining 
how the error norm varied with m. The results from the experiment used 



to determine the value of m for each M are presented in Table 10 The 
final value of m shown for each M was used to calculate the norms in our 
experiments. The same quadrature rule used to evaluate Ei, E2 was used to 
evaluate Enr,. 
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Table 10: Results of the gravity quadrature test. Estimated values of the in- 
tegral of the analytic gravity field obtained using a 4-point Gauss quadrature 
scheme, with different numbers of integration regions m^, within each cell. 



M 



m 



\9z 



\9z 



E., 



12 3 2.686359587701e+02 

4 2.686359587702e+02 

5 2.686359587700e+02 

6 2.686359587701e+02 

7 2.686359587703e+02 

8 2.686359587699e+02 



3.461398542186e-02 
3.461399254307e-02 
3.461399453156e-02 
3.461399525775e-02 
3.461399557324e-02 
3.461399572806e-02 



3.381867068310e-05 
3.403021478492e-05 
3.415713959000e-05 
3.424175549691e-05 
3.430219515320e-05 
3.434752475888e-05 



24 2 2.686359587702e+02 

3 2.686359587702e+02 

4 2.686359587698e+02 



3.461399254307e-02 
3.461399525775e-02 
3.461399572806e-02 



3.403021478492e-05 
3.424175549691e-05 
3.434752475888e-05 



48 1 2.686359587702e+02 3.461399254307e-02 3.403021478492e-05 
2 2.686359587698e+02 3.461399572806e-02 3.434752475888e-05 



96 



2.686359587698e+02 3.461399572806e-02 3.434752475888e-05 
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